roc auc score
Large Language Models as Attribution Regularizers for Efficient Model Training
Vukadin, Davor, Šilić, Marin, Delač, Goran
Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains. However, effectiv ely leveraging their vast knowledge for training smaller downstream model s remains an open challenge, especially in domains like tabular data lea rning, where simpler models are often preferred due to interpretability and efficiency. In this paper, we introduce a novel yet straightforward meth od for incorporating LLM-generated global task feature attributions i nto the training process of smaller networks. Specifically, we propose an attribution-matching regularization term that aligns the training dyna mics of the smaller model with the insights provided by the LLM. By doing so, our approach yields superior performance in few-shot learn ing scenarios. Notably, our method requires only black-box API access to th e LLM, making it easy to integrate into existing training pipeline s with minimal computational overhead. Furthermore, we demonstrate how this method can be used to ad dress common issues in real-world datasets, such as skewness and b ias. By integrating high-level knowledge from LLMs, our approach i mproves generalization, even when training data is limited or imbal anced. We validate its effectiveness through extensive experiments a cross multiple tasks, demonstrating improved learning efficiency and model robustness.
Foundation for unbiased cross-validation of spatio-temporal models for species distribution modeling
Koldasbayeva, Diana, Zaytsev, Alexey
Species Distribution Models (SDMs) often suffer from spatial autocorrelation (SAC), leading to biased performance estimates. We tested cross-validation (CV) strategies - random splits, spatial blocking with varied distances, environmental (ENV) clustering, and a novel spatio-temporal method - under two proposed training schemes: LAST FOLD, widely used in spatial CV at the cost of data loss, and RETRAIN, which maximizes data usage but risks reintroducing SAC. LAST FOLD consistently yielded lower errors and stronger correlations. Spatial blocking at an optimal distance (SP 422) and ENV performed best, achieving Spearman and Pearson correlations of 0.485 and 0.548, respectively, although ENV may be unsuitable for long-term forecasts involving major environmental shifts. A spatio-temporal approach yielded modest benefits in our moderately variable dataset, but may excel with stronger temporal changes. These findings highlight the need to align CV approaches with the spatial and temporal structure of SDM data, ensuring rigorous validation and reliable predictive outcomes.
Gradient Boosting Decision Trees on Medical Diagnosis over Tabular Data
Yıldız, A. Yarkın, Kalayci, Asli
Medical diagnosis is a crucial task in the medical field, in terms of providing accurate classification and respective treatments. Having near-precise decisions based on correct diagnosis can affect a patient's life itself, and may extremely result in a catastrophe if not classified correctly. Several traditional machine learning (ML), such as support vector machines (SVMs) and logistic regression, and state-of-the-art tabular deep learning (DL) methods, including TabNet and TabTransformer, have been proposed and used over tabular medical datasets. Additionally, due to the superior performances, lower computational costs, and easier optimization over different tasks, ensemble methods have been used in the field more recently. They offer a powerful alternative in terms of providing successful medical decision-making processes in several diagnosis tasks. In this study, we investigated the benefits of ensemble methods, especially the Gradient Boosting Decision Tree (GBDT) algorithms in medical classification tasks over tabular data, focusing on XGBoost, CatBoost, and LightGBM. The experiments demonstrate that GBDT methods outperform traditional ML and deep neural network architectures and have the highest average rank over several benchmark tabular medical diagnosis datasets. Furthermore, they require much less computational power compared to DL models, creating the optimal methodology in terms of high performance and lower complexity.
AutoML
This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes 1 second & yields SOTA performance (competitive with the best AutoML pipelines in an hour). So far, it is limited in scale, though: it can only tackle problems up to 1000 training examples, 100 features and 10 classes. TabPFN is radically different from previous ML methods. It is a meta-learned algorithm and it provably approximates Bayesian inference with a prior for principles of causality and simplicity. TabPFN happens to be a single transformer, but this is not the usual "trees vs nets" b a t t l e.
Multi-Perspective Anomaly Detection
Madan, Manav, Jakob, Peter, Schmid-Schirling, Tobias, Valada, Abhinav
Multi-view classification is inspired by the behavior of humans, especially when fine-grained features or in our case rarely occurring anomalies are to be detected. Current contributions point to the problem of how high-dimensional data can be fused. In this work, we build upon the deep support vector data description algorithm and address multi-perspective anomaly detection using three different fusion techniques i.e. early fusion, late fusion, and late fusion with multiple decoders. We employ different augmentation techniques with a denoising process to deal with scarce one-class data, which further improves the performance (ROC AUC = 80\%). Furthermore, we introduce the dices dataset that consists of over 2000 grayscale images of falling dices from multiple perspectives, with 5\% of the images containing rare anomalies (e.g. drill holes, sawing, or scratches). We evaluate our approach on the new dices dataset using images from two different perspectives and also benchmark on the standard MNIST dataset. Extensive experiments demonstrate that our proposed approach exceeds the state-of-the-art on both the MNIST and dices datasets. To the best of our knowledge, this is the first work that focuses on addressing multi-perspective anomaly detection in images by jointly using different perspectives together with one single objective function for anomaly detection.
Graph embeddings via matrix factorization for link prediction: smoothing or truncating negatives?
Link prediction -- the process of uncovering missing links in a complex network -- is an important problem in information sciences, with applications ranging from social sciences to molecular biology. Recent advances in neural graph embeddings have proposed an end-to-end way of learning latent vector representations of nodes, with successful application in link prediction tasks. Yet, our understanding of the internal mechanisms of such approaches has been rather limited, and only very recently we have witnessed the development of a very compelling connection to the mature matrix factorization theory. In this work, we make an important contribution to our understanding of the interplay between the skip-gram powered neural graph embedding algorithms and the matrix factorization via SVD. In particular, we show that the link prediction accuracy of graph embeddings strongly depends on the transformations of the original graph co-occurrence matrix that they decompose, sometimes resulting in staggering boosts of accuracy performance on link prediction tasks. Our improved approach to learning low-rank factorization embeddings that incorporate information from unlikely pairs of nodes yields results on par with the state-of-the-art link prediction performance achieved by a complex neural graph embedding model
ROC Curve in Machine Learning
The Receiver Operating Characteristic (ROC) curve is a popular tool used with binary classifiers. It is very similar to the precision/recall curve. Still, instead of plotting precision versus recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate (FPR). The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to 1 – the true negative rate (TNR), which is the ratio of negative cases that are correctly classified as negative.
Cost-Sensitive Decision Trees for Imbalanced Classification
The decision tree algorithm is effective for balanced classification, although it does not perform well on imbalanced datasets. The split points of the tree are chosen to best separate examples into two groups with minimum mixing. When both groups are dominated by examples from one class, the criterion used to select a split point will see good separation, when in fact, the examples from the minority class are being ignored. This problem can be overcome by modifying the criterion used to evaluate split points to take the importance of each class into account, referred to generally as the weighted split-point or weighted decision tree. In this tutorial, you will discover the weighted decision tree for imbalanced classification.
Predicting drug properties with parameter-free machine learning: Pareto-Optimal Embedded Modeling (POEM)
Brereton, Andrew E., MacKinnon, Stephen, Safikhani, Zhaleh, Reeves, Shawn, Alwash, Sana, Shahani, Vijay, Windemuth, Andreas
The prediction of absorption, distribution, metabolism, excretion, and toxicity (ADMET) of small molecules from their molecular structure is a central problem in medicinal chemistry with great practical importance in drug discovery. Creating predictive models conventionally requires substantial trial-and-error for the selection of molecular representations, machine learning (ML) algorithms, and hyperparameter tuning. A generally applicable method that performs well on all datasets without tuning would be of great value but is currently lacking. Here, we describe Pareto-Optimal Embedded Modeling (POEM), a similarity-based method for predicting molecular properties. POEM is a non-parametric, supervised ML algorithm developed to generate reliable predictive models without need for optimization. POEMs predictive strength is obtained by combining multiple different representations of molecular structures in a context-specific manner, while maintaining low dimensionality. We benchmark POEM relative to industry-standard ML algorithms and published results across 17 classifications tasks. POEM performs well in all cases and reduces the risk of overfitting.
Cost-Sensitive Logistic Regression for Imbalanced Classification
Logistic regression does not support imbalanced classification directly. Instead, the training algorithm used to fit the logistic regression model must be modified to take the skewed distribution into account. This can be achieved by specifying a class weighting configuration that is used to influence the amount that logistic regression coefficients are updated during training. The weighting can penalize the model less for errors made on examples from the majority class and penalize the model more for errors made on examples from the minority class. The result is a version of logistic regression that performs better on imbalanced classification tasks, generally referred to as cost-sensitive or weighted logistic regression.